Introduction
Of all the applications of machine-learning, diagnosing any serious disease using a black box is always going to be a hard sell. If the output from a model is the particular course of treatment (potentially with side-effects), or surgery, or the absence of treatment, people are going to want to know why.
This dataset gives a number of variables along with a target condition of having or not having heart disease. Below, the data is first tried to be predicted using a losgistic regression model, then a random forest model. However, decision tree is selcted as the mode appropriate model.
Data Description age: The person’s age in years
sex: The person’s sex (1 = male, 0 = female)
cp: The chest pain experienced (Value 1: typical angina, Value 2: atypical angina, Value 3: non-anginal pain, Value 4: asymptomatic)
trestbps: The person’s resting blood pressure (mm Hg on admission to the hospital)
chol: The person’s cholesterol measurement in mg/dl
fbs: The person’s fasting blood sugar (> 120 mg/dl, 1 = true; 0 = false)
restecg: Resting electrocardiographic measurement (0 = normal, 1 = having ST-T wave abnormality, 2 = showing probable or definite left ventricular hypertrophy by Estes’ criteria)
thalach: The person’s maximum heart rate achieved
exang: Exercise induced angina (1 = yes; 0 = no)
oldpeak: ST depression induced by exercise relative to rest (‘ST’ relates to positions on the ECG plot. See more here)
slope: the slope of the peak exercise ST segment (Value 1: upsloping, Value 2: flat, Value 3: downsloping)
ca: The number of major vessels (0-3)
thal: A blood disorder called thalassemia (3 = normal; 6 = fixed defect; 7 = reversable defect)
target: Heart disease (0 = no, 1 = yes)
Presentation
library(tidyverse)
library(readr)
library(ROCR)
library(PerformanceAnalytics)
library(e1071)
library(caret)
library(gbm)
library(corrplot)
library(ggcorrplot)
library(MASS)
library(rpart)
library(caTools)
library(naivebayes)
library(class)
library(ISLR)
library(glmnet)
library(Hmisc)
library(funModeling)
library(pROC)
library(randomForest)
library(klaR)
library(scales)
library(cluster)
library(factoextra)
library(DataExplorer)
library(ClustOfVar)
library(GGally)
library(gmodels)
library(C50)
# reading the data
heart <- read.csv("heart.csv")
str(heart)
## 'data.frame': 303 obs. of 14 variables:
## $ ï..age : int 63 37 41 56 57 57 56 44 52 57 ...
## $ sex : int 1 1 0 1 0 1 0 1 1 1 ...
## $ cp : int 3 2 1 1 0 0 1 1 2 2 ...
## $ trestbps: int 145 130 130 120 120 140 140 120 172 150 ...
## $ chol : int 233 250 204 236 354 192 294 263 199 168 ...
## $ fbs : int 1 0 0 0 0 0 0 0 1 0 ...
## $ restecg : int 0 1 0 1 1 1 0 1 1 1 ...
## $ thalach : int 150 187 172 178 163 148 153 173 162 174 ...
## $ exang : int 0 0 0 0 1 0 0 0 0 0 ...
## $ oldpeak : num 2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
## $ slope : int 0 0 2 2 2 1 1 2 2 2 ...
## $ ca : int 0 0 0 0 0 0 0 0 0 0 ...
## $ thal : int 1 2 2 2 2 1 2 3 3 2 ...
## $ target : int 1 1 1 1 1 1 1 1 1 1 ...
summary(heart)
## ï..age sex cp trestbps
## Min. :29.00 Min. :0.0000 Min. :0.000 Min. : 94.0
## 1st Qu.:47.50 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:120.0
## Median :55.00 Median :1.0000 Median :1.000 Median :130.0
## Mean :54.37 Mean :0.6832 Mean :0.967 Mean :131.6
## 3rd Qu.:61.00 3rd Qu.:1.0000 3rd Qu.:2.000 3rd Qu.:140.0
## Max. :77.00 Max. :1.0000 Max. :3.000 Max. :200.0
## chol fbs restecg thalach
## Min. :126.0 Min. :0.0000 Min. :0.0000 Min. : 71.0
## 1st Qu.:211.0 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:133.5
## Median :240.0 Median :0.0000 Median :1.0000 Median :153.0
## Mean :246.3 Mean :0.1485 Mean :0.5281 Mean :149.6
## 3rd Qu.:274.5 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:166.0
## Max. :564.0 Max. :1.0000 Max. :2.0000 Max. :202.0
## exang oldpeak slope ca
## Min. :0.0000 Min. :0.00 Min. :0.000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.00 1st Qu.:1.000 1st Qu.:0.0000
## Median :0.0000 Median :0.80 Median :1.000 Median :0.0000
## Mean :0.3267 Mean :1.04 Mean :1.399 Mean :0.7294
## 3rd Qu.:1.0000 3rd Qu.:1.60 3rd Qu.:2.000 3rd Qu.:1.0000
## Max. :1.0000 Max. :6.20 Max. :2.000 Max. :4.0000
## thal target
## Min. :0.000 Min. :0.0000
## 1st Qu.:2.000 1st Qu.:0.0000
## Median :2.000 Median :1.0000
## Mean :2.314 Mean :0.5446
## 3rd Qu.:3.000 3rd Qu.:1.0000
## Max. :3.000 Max. :1.0000
Methodology
Results and Discussion
## Here we are cleaning the data set and modifying the factors
data2 <- heart %>%
mutate(sex = if_else(sex == 1, "MALE", "FEMALE"),
fbs = if_else(fbs == 1, ">120", "<=120"),
exang = if_else(exang == 1, "YES" ,"NO"),
cp = if_else(cp == 1, "ATYPICAL ANGINA",
if_else(cp == 2, "NON-ANGINAL PAIN", "ASYMPTOMATIC")),
restecg = if_else(restecg == 0, "NORMAL",
if_else(restecg == 1, "ABNORMALITY", "PROBABLE OR DEFINITE")),
slope = as.factor(slope),
ca = as.factor(ca),
thal = as.factor(thal),
target = if_else(target == 1, "YES", "NO")
) %>%
mutate_if(is.character, as.factor) %>%
dplyr::select(target, sex, fbs, exang, cp, restecg, slope, ca, thal, everything())
summary(data2)
## target sex fbs exang cp
## NO :138 FEMALE: 96 <=120:258 NO :204 ASYMPTOMATIC :166
## YES:165 MALE :207 >120 : 45 YES: 99 ATYPICAL ANGINA : 50
## NON-ANGINAL PAIN: 87
##
##
##
## restecg slope ca thal ï..age
## ABNORMALITY :152 0: 21 0:175 0: 2 Min. :29.00
## NORMAL :147 1:140 1: 65 1: 18 1st Qu.:47.50
## PROBABLE OR DEFINITE: 4 2:142 2: 38 2:166 Median :55.00
## 3: 20 3:117 Mean :54.37
## 4: 5 3rd Qu.:61.00
## Max. :77.00
## trestbps chol thalach oldpeak
## Min. : 94.0 Min. :126.0 Min. : 71.0 Min. :0.00
## 1st Qu.:120.0 1st Qu.:211.0 1st Qu.:133.5 1st Qu.:0.00
## Median :130.0 Median :240.0 Median :153.0 Median :0.80
## Mean :131.6 Mean :246.3 Mean :149.6 Mean :1.04
## 3rd Qu.:140.0 3rd Qu.:274.5 3rd Qu.:166.0 3rd Qu.:1.60
## Max. :200.0 Max. :564.0 Max. :202.0 Max. :6.20
logit.mode1 <- glm( target ~ . , family = "binomial", data = data2)
summary(logit.mode1)
##
## Call:
## glm(formula = target ~ ., family = "binomial", data = data2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8019 -0.3504 0.1406 0.4323 2.9873
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.799378 3.176635 -0.252 0.80132
## sexMALE -1.441798 0.521275 -2.766 0.00568 **
## fbs>120 0.627653 0.555851 1.129 0.25882
## exangYES -1.017763 0.429158 -2.372 0.01771 *
## cpATYPICAL ANGINA 0.387536 0.558853 0.693 0.48803
## cpNON-ANGINAL PAIN 1.350664 0.472046 2.861 0.00422 **
## restecgNORMAL -0.370497 0.383120 -0.967 0.33352
## restecgPROBABLE OR DEFINITE -1.322756 2.451173 -0.540 0.58944
## slope1 -0.551347 0.796351 -0.692 0.48872
## slope2 0.668520 0.885995 0.755 0.45052
## ca1 -2.215448 0.498913 -4.441 8.97e-06 ***
## ca2 -3.164688 0.738587 -4.285 1.83e-05 ***
## ca3 -2.542796 0.903922 -2.813 0.00491 **
## ca4 0.890341 1.608751 0.553 0.57996
## thal1 2.249101 2.103214 1.069 0.28491
## thal2 2.426802 2.010214 1.207 0.22734
## thal3 0.924743 2.017278 0.458 0.64666
## ï..age 0.031697 0.024565 1.290 0.19694
## trestbps -0.018380 0.011327 -1.623 0.10465
## chol -0.005126 0.004034 -1.271 0.20382
## thalach 0.022643 0.010979 2.062 0.03918 *
## oldpeak -0.272669 0.233259 -1.169 0.24242
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 417.64 on 302 degrees of freedom
## Residual deviance: 192.37 on 281 degrees of freedom
## AIC: 236.37
##
## Number of Fisher Scoring iterations: 6
# when men have the heart attack, men come to ED with pain
# while women have symptoms that are not nominally associated with heart attack
# moreover gender bias by physician plays a role
# blood sugar >120 is more prone
# old peak is not a sig variable
logit.mode2 <- glm( target ~ sex*cp + . , family = "binomial", data = data2)
summary(logit.mode2)
##
## Call:
## glm(formula = target ~ sex * cp + ., family = "binomial", data = data2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.68994 -0.38316 0.07254 0.40307 2.99164
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.402996 3.582727 -0.671 0.50240
## sexMALE -0.864383 0.633699 -1.364 0.17256
## cpATYPICAL ANGINA 0.909081 0.972810 0.934 0.35005
## cpNON-ANGINAL PAIN 3.439028 1.382793 2.487 0.01288 *
## fbs>120 0.685497 0.564049 1.215 0.22425
## exangYES -0.890768 0.434208 -2.051 0.04022 *
## restecgNORMAL -0.411750 0.389731 -1.057 0.29074
## restecgPROBABLE OR DEFINITE -2.069213 3.794360 -0.545 0.58552
## slope1 -0.514944 0.792875 -0.649 0.51604
## slope2 0.612706 0.891511 0.687 0.49191
## ca1 -2.339835 0.518361 -4.514 6.36e-06 ***
## ca2 -3.079886 0.735950 -4.185 2.85e-05 ***
## ca3 -2.378054 0.911009 -2.610 0.00904 **
## ca4 0.791737 1.568229 0.505 0.61366
## thal1 3.104065 2.579220 1.203 0.22879
## thal2 3.437036 2.518750 1.365 0.17239
## thal3 1.825456 2.504979 0.729 0.46617
## ï..age 0.032036 0.024723 1.296 0.19505
## trestbps -0.017222 0.011413 -1.509 0.13131
## chol -0.006397 0.004385 -1.459 0.14457
## thalach 0.024882 0.011096 2.242 0.02493 *
## oldpeak -0.260573 0.234493 -1.111 0.26647
## sexMALE:cpATYPICAL ANGINA -0.647946 1.170074 -0.554 0.57974
## sexMALE:cpNON-ANGINAL PAIN -2.503991 1.471032 -1.702 0.08872 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 417.64 on 302 degrees of freedom
## Residual deviance: 188.73 on 279 degrees of freedom
## AIC: 236.73
##
## Number of Fisher Scoring iterations: 6
# the lower the AIC the better your model, the right direction in lower AIC
#split into test and train
set.seed(12345)
data2_rand <- data2[order(runif(303)), ]
# the 303 number is the number of rows
summary(data2$trestbps)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 94.0 120.0 130.0 131.6 140.0 200.0
summary(data2_rand$trestbps)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 94.0 120.0 130.0 131.6 140.0 200.0
str(data2)
## 'data.frame': 303 obs. of 14 variables:
## $ target : Factor w/ 2 levels "NO","YES": 2 2 2 2 2 2 2 2 2 2 ...
## $ sex : Factor w/ 2 levels "FEMALE","MALE": 2 2 1 2 1 2 1 2 2 2 ...
## $ fbs : Factor w/ 2 levels "<=120",">120": 2 1 1 1 1 1 1 1 2 1 ...
## $ exang : Factor w/ 2 levels "NO","YES": 1 1 1 1 2 1 1 1 1 1 ...
## $ cp : Factor w/ 3 levels "ASYMPTOMATIC",..: 1 3 2 2 1 1 2 2 3 3 ...
## $ restecg : Factor w/ 3 levels "ABNORMALITY",..: 2 1 2 1 1 1 2 1 1 1 ...
## $ slope : Factor w/ 3 levels "0","1","2": 1 1 3 3 3 2 2 3 3 3 ...
## $ ca : Factor w/ 5 levels "0","1","2","3",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ thal : Factor w/ 4 levels "0","1","2","3": 2 3 3 3 3 2 3 4 4 3 ...
## $ ï..age : int 63 37 41 56 57 57 56 44 52 57 ...
## $ trestbps: int 145 130 130 120 120 140 140 120 172 150 ...
## $ chol : int 233 250 204 236 354 192 294 263 199 168 ...
## $ thalach : int 150 187 172 178 163 148 153 173 162 174 ...
## $ oldpeak : num 2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
set.seed(300)
data2_train <- data2_rand[1:240, ]
data2_test <- data2_rand[241:303, ]
rf <- randomForest(target ~ ., data = data2_train)
rf
##
## Call:
## randomForest(formula = target ~ ., data = data2_train)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 19.58%
## Confusion matrix:
## NO YES class.error
## NO 84 24 0.2222222
## YES 23 109 0.1742424
data2_pred_rf <- predict(rf, data2_test)
CrossTable(data2_test$target, data2_pred_rf,
prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,
dnn = c('actual heart disease', 'predicted heart disease'))
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 63
##
##
## | predicted heart disease
## actual heart disease | NO | YES | Row Total |
## ---------------------|-----------|-----------|-----------|
## NO | 24 | 6 | 30 |
## | 0.381 | 0.095 | |
## ---------------------|-----------|-----------|-----------|
## YES | 4 | 29 | 33 |
## | 0.063 | 0.460 | |
## ---------------------|-----------|-----------|-----------|
## Column Total | 28 | 35 | 63 |
## ---------------------|-----------|-----------|-----------|
##
##
accuracy1=(32+22)/(63)
accuracy1
## [1] 0.8571429
# this model is 85% accurate
# this model accuatately predicted the presence of heart disease 85 percent of the time
# we have 3/63 false negatives
# try decision tree to det imp factor
# try without thal, thalach, ca, oldpeak, slope
# everyone gets ekg, fasting blood sugar, cholestorol
data2_dt <- C5.0(data2_train[-1], data2_train$target)
#plot will not help but try
plot(data2_dt)
# we can see the most critical factor is checking balance
# display simple facts about the tree
data2_dt
##
## Call:
## C5.0.default(x = data2_train[-1], y = data2_train$target)
##
## Classification Tree
## Number of samples: 240
## Number of predictors: 13
##
## Tree size: 17
##
## Non-standard options: attempt to group attributes
# display detailed information about the tree
summary(data2_dt)
##
## Call:
## C5.0.default(x = data2_train[-1], y = data2_train$target)
##
##
## C5.0 [Release 2.07 GPL Edition] Sun Apr 19 23:25:52 2020
## -------------------------------
##
## Class specified by attribute `outcome'
##
## Read 240 cases (14 attributes) from undefined.data
##
## Decision tree:
##
## thal = 2:
## :...ca in {0,4}:
## : :...oldpeak <= 1.6: YES (83/4)
## : : oldpeak > 1.6:
## : : :...ï..age <= 61: NO (3)
## : : ï..age > 61: YES (2)
## : ca in {1,2,3}:
## : :...cp = NON-ANGINAL PAIN: YES (11/1)
## : cp = ASYMPTOMATIC:
## : :...sex = MALE: NO (16/3)
## : : sex = FEMALE:
## : : :...trestbps <= 134: YES (3)
## : : trestbps > 134: NO (2)
## : cp = ATYPICAL ANGINA:
## : :...restecg in {ABNORMALITY,PROBABLE OR DEFINITE}: YES (3)
## : restecg = NORMAL:
## : :...exang = NO: NO (3)
## : exang = YES: YES (2)
## thal in {0,1,3}:
## :...oldpeak > 0.7: NO (74/9)
## oldpeak <= 0.7:
## :...ca = 4: YES (0)
## ca in {1,2,3}: NO (14/4)
## ca = 0:
## :...cp in {ATYPICAL ANGINA,NON-ANGINAL PAIN}: YES (11/1)
## cp = ASYMPTOMATIC:
## :...chol > 237: NO (4)
## chol <= 237:
## :...ï..age <= 42: NO (2)
## ï..age > 42: YES (7)
##
##
## Evaluation on training data (240 cases):
##
## Decision Tree
## ----------------
## Size Errors
##
## 16 22( 9.2%) <<
##
##
## (a) (b) <-classified as
## ---- ----
## 102 6 (a): class NO
## 16 116 (b): class YES
##
##
## Attribute usage:
##
## 100.00% thal
## 83.33% oldpeak
## 69.17% ca
## 26.67% cp
## 8.75% sex
## 5.83% ï..age
## 5.42% chol
## 3.33% restecg
## 2.08% exang
## 2.08% trestbps
##
##
## Time: 0.0 secs
# thal is the principal attribute
# 0 is
data2_train$thal = NULL
data2_test$thal = NULL
data2_dt <- C5.0(data2_train[-1], data2_train$target)
#plot will not help but try
plot(data2_dt)
# thal was removed because it is not the first thing donte to apatient
# sthal is a stress tet that is further down the road
# if you had a mole and doc suggests chemotherapy
data2_train$thalach = NULL
data2_test$thalach = NULL
data2_dt <- C5.0(data2_train[-1], data2_train$target)
#plot will not help but try
plot(data2_dt)
data2_train$ca = NULL
data2_test$ca = NULL
data2_dt <- C5.0(data2_train[-1], data2_train$target)
#plot will not help but try
plot(data2_dt)
data2_train$oldpeak = NULL
data2_test$oldpeak = NULL
data2_dt <- C5.0(data2_train[-1], data2_train$target)
#plot will not help but try
plot(data2_dt)
data2_train$slope = NULL
data2_test$slope = NULL
data2_dt <- C5.0(data2_train[-1], data2_train$target)
#plot will not help but try
plot(data2_dt)
#prestest prob vs post test prob
data2_pred <- predict(data2_dt, data2_test)
CrossTable(data2_test$target, data2_pred,
prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,
dnn = c('actual target', 'predicted target'))
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 63
##
##
## | predicted target
## actual target | NO | YES | Row Total |
## --------------|-----------|-----------|-----------|
## NO | 26 | 4 | 30 |
## | 0.413 | 0.063 | |
## --------------|-----------|-----------|-----------|
## YES | 10 | 23 | 33 |
## | 0.159 | 0.365 | |
## --------------|-----------|-----------|-----------|
## Column Total | 36 | 27 | 63 |
## --------------|-----------|-----------|-----------|
##
##
confusionMatrix(data2_pred, as.factor(data2_test$target))
## Confusion Matrix and Statistics
##
## Reference
## Prediction NO YES
## NO 26 10
## YES 4 23
##
## Accuracy : 0.7778
## 95% CI : (0.6554, 0.8728)
## No Information Rate : 0.5238
## P-Value [Acc > NIR] : 2.846e-05
##
## Kappa : 0.5586
##
## Mcnemar's Test P-Value : 0.1814
##
## Sensitivity : 0.8667
## Specificity : 0.6970
## Pos Pred Value : 0.7222
## Neg Pred Value : 0.8519
## Prevalence : 0.4762
## Detection Rate : 0.4127
## Detection Prevalence : 0.5714
## Balanced Accuracy : 0.7818
##
## 'Positive' Class : NO
##
accuracy = (26+23)/63
accuracy
## [1] 0.7777778
################################################# create another decision tree by sex split
#Splitiing the view of the tree
library(partykit)
## Loading required package: grid
## Loading required package: libcoin
## Loading required package: mvtnorm
myTree2 <- C50:::as.party.C5.0(data2_dt)
branch1 = plot(myTree2[2])
branch2 = plot(myTree2[20])
Thal description Nuclear stress testing requires the injection of a tracer, commonly technicium 99M (Myoview or Cardiolyte), which is then taken up by healthy, viable myocardial cells. A camera (detector) is used afterwards to image the heart and compare segments. A coronary stenosis is detected when a myocardial segment takes up the nuclear tracer at rest, but not during cardiac stress. This is called a “reversible defect.” Scarred myocardium from prior infarct will not take up tracer at all and is referred to as a “fixed defect.”